Zero-inflation in the Poisson lognormal family

Statistiques aux sommets, Rochebrune

Bastien Batardière

PhD - UMR MIA Paris-Saclay

Julien Chiquet

UMR MIA Paris-Saclay

François Gindraud

IR INRIA

Mahendra Mariadassou

MaIAGE

March 25, 2024

Motivation: multivariate count data

Routinely gathered in ecology/microbiology/genomics

Data tables

  • Abundances: read counts of species/transcripts \(j\) in sample \(i\)
  • Covariates: value of environmental variable \(k\) in sample \(i\)
  • Offsets: sampling effort for species/transcripts \(j\) in sample \(i\)

Goals

  • understand environmental effects (regression, classification)
  • exhibit patterns of diversity (clustering, dimension reduction)
  • understand between-species interactions (covariance selection)

Illustration: microcosm data set (Mariadassou et al. 2023)

  • Study carried out at INRAE “Domaine Expérimental du Pin”
  • microbiotas of 44 lactating cows at 4 body site (mouth, nose, vagina, milk) \(\times\) 4 time points
  • Abundances of \(\approx\) 1200 “species” (Amplicon Sequence Variants) with known taxonomy

Model fo multivariate count data

The Poisson Lognormal model (PLN) (Aitchison and Ho 1989)

PLN is a multivariate generalized linear model, where

  • the counts \(\mathbf{Y}_i\) are the response variables
  • the main effect is due to a linear combination of the covariates \(\mathbf{x}_i\)
  • a vector of offsets \(\mathbf{o}_i\) can be specified for each sample

\[ \mathbf{Y}_i | \mathbf{Z}_i \sim \mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right), \qquad \mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{x}_i^\top\mathbf{B}},\boldsymbol\Sigma). \]

Estimation

Parameters \(\mathbf{\theta} = (\mathbf{B}, \boldsymbol\Sigma)\) are estimated by inference techniques from latent-variable models

Analysis of microcosm with standard PLN

Retrieve microcosm data

microcosm <- readRDS("microcosm_reduced.rds")
microcosm$Abundance <- microcosm$Abundance[, colMeans(microcosm$Abundance > 0) > 0.05]
microcosm$site_time <- droplevels(microcosm$site_time)
sum(microcosm$Abundance == 0) / length(microcosm$Abundance)
[1] 0.9033976

Still 90% of zeros !

Assess environnemental effects

PLN           <- PLN(Abundance ~ 1             + offset(log(Offset)), data = microcosm)
PLN_site      <- PLN(Abundance ~ 0 + site      + offset(log(Offset)), data = microcosm)
PLN_time      <- PLN(Abundance ~ 0 + time      + offset(log(Offset)), data = microcosm)
PLN_site_time <- PLN(Abundance ~ 0 + site_time + offset(log(Offset)), data = microcosm)
Model selection criteria
nb_param loglik BIC ICL
PLN 33929 -220146.1 -335526.4 -856215.2
PLN site 34706 -217792.1 -335814.7 -856263.3
PLN time 34706 -218444.9 -336467.6 -854845.5
PLN site * time 37555 -213942.0 -341653.1 -859927.5



Handling
zeros in
multivariate
count tables

A zero-inflated PLN

Mixture of PLN and Bernoulli distribution

Use two latent vectors \(\mathbf{W}_i\) and \(\mathbf{Z}_i\) to model excess of zeroes and dependence structure

\[\begin{array}{rrl} \text{PLN latent space} & \boldsymbol{Z}_i = (Z_{ij})_{j=1\dots p} & \sim \mathcal{N}(\mathbf{x}_i^\top \mathbf{B}, \mathbf{\Sigma}), \\[1.5ex] \text{excess of zeros} & \boldsymbol{W}_{i} = (W_{ij})_{j=1\dots p} & \sim \otimes_{j=1}^p \cal B(\pi_{ij}), \\[1.5ex] \text{observation space} & Y_{ij} \, | \, W_{ij}, Z_{ij} & \sim^\text{indep} W_{ij}\delta_0 + (1-W_{ij})\mathcal{P} \left(\exp\{o_{ij} + Z_{ij}\}\right),\\ \end{array}\]

\(\rightsquigarrow\) Better handling of zeros +additional interpretable parameters

Basic properties

Letting \(A_{ij} \triangleq \exp \left( o_{ij} + \mu_{ij} + \sigma_{jj}/2\right)\) with \(\mu_{ij} = \mathbf{x}_i^\intercal \mathbf{B}_j\), then

  • \(\mathbb{E}\left(Y_{ij}\right) = (1-\pi_{ij}) A_{ij} \leq A_{ij}\) (PLN’s mean),
  • \(\mathbb{V}\left(Y_{ij}\right) = (1-\pi_{ij})A_{ij} + (1-\pi_{ij})A_{ij}^2 \left( e^{\sigma_{jj}} - (1-\pi_{ij}) \right)\).

ZI-PLN: refinements

Modeling of the pure zero component

\[\begin{align*} \pi_{ij} & = \pi \in [0,1] & \text{(single global parameter)} \\ \pi_{ij} & = \pi_j \in [0,1] & \text{(species dependent)} \\ \pi_{ij} & = \pi_i \in [0,1] & \text{(site dependent)} \\ \pi_{ij} & = \mathrm{logit}^{-1}( \boldsymbol X^{0}\boldsymbol B^0)_{ij}, ~ \boldsymbol X^0 \in \mathbb R^{n \times d_0}, ~ \boldsymbol B^0 \in \mathbb R^{d_0\times p} & \text{(site covariates)} \\ \pi_{ij} & = \mathrm{logit}^{-1}(\bar{\boldsymbol{B}}^0 \bar{\boldsymbol{X}}^{0})_{ij}, ~ \bar{\boldsymbol{B}}^0 \in \mathbb R^{n \times d_0}, ~ \bar{\boldsymbol X}^0 \in \mathbb R^{d_0\times p} & \text{(species covariates)} \end{align*}\]

Proposition 1 (Identifiability of ZI-PLN)  

  • The “single global parameter” ZI-PLN model with parameter \(\mathbf{\theta} = (\mathbf{\Omega}, \mathbf{\mu}, \mathbf{\pi})\) and parameter space \(\mathbb{S}_p^{++} \times \mathbb{R}^p \times (0,1)^{p}\) is identifiable (moment-based proof)
  • The site-covariates zero-inflation model with parameter \(\mathbf{\theta} = (\mathbf{\Omega}, \mathbf{B}, \mathbf{B}^0)\) and parameter space \(\mathbb{S}_p^{++} \times \mathcal{M}_{p,d}(\mathbb{R}) \times \mathcal{M}_{p,d}(\mathbb{R})\) is identifiable if and only if both \(n\times d\) and \(n \times d_0\) matrices of covariates \(\mathbf{X}\) and \(\mathbf{X}^0\) are full rank.

Variational
Inference

Standard mean-field

Variational approximation breaks all dependencies

\[\begin{equation*} p(\mathbf{Z}_i, \mathbf{W}_i | \mathbf{Y}_i) \approx q_{\psi_i}(\mathbf{Z}_i, \mathbf{W}_i) \triangleq q_{\psi_i}(\mathbf{Z}_i) q_{\psi_i}(\mathbf{W}_i) = \otimes_{j=1}^p q_{\psi_i}(Z_{ij}) q_{\psi_i}(W_{ij}) \end{equation*}\]

with Gaussian and Bernoulli distributions for \(Z_{ij}\) and \(W_{ij}\), then

\[\begin{equation*} q_{\psi_i}(\mathbf{Z}_i, \mathbf{W}_i) = \otimes_{j=1}^p \mathcal N\left( M_{ij}, S_{ij}^2\right) \mathcal B\left(\rho_{ij} \right) \end{equation*}\]

Variational lower bound

Let \(\theta = (\mathbf{B}, \mathbf{B}^0, \mathbf{\Sigma})\) and \(\psi= (\mathbf{M}, \mathbf{S}, \mathbf{R})\), then

\[\begin{align*} J(\theta, \psi) & = \log p_\theta(\mathbf{Y}) - KL(p_\theta(. | \mathbf{Y}) \| q_\psi(.) ) \\ & = \mathbb{E}_{q_{\psi}} \log p_\theta(\mathbf{Z}, \mathbf{W}, \mathbf{Y}) - \mathbb{E}_{q_{\psi}} \log q_\psi(\mathbf{Z}, \mathbf{W}) \\ & = \mathbb{E}_{q_{\psi}} \log p_\theta (\mathbf{Y} | \mathbf{Z}, \mathbf{W}) + \mathbb{E}_{q_{\psi}} \log p_\theta (\mathbf{Z}) + \mathbb{E}_{q_{\psi}} \log p_\theta (\mathbf{W}) \\ & \qquad - \mathbb{E}_{q_{\psi}} \log q_{\psi} (\mathbf{Z}) - \mathbb{E}_{q_{\psi}} \log q_{\psi} (\mathbf{W}) \\ \end{align*}\]

Property: concave in each element of \(\theta, \psi\).

Sparse regularization

Recall that \(\theta = (\mathbf{B}, \mathbf{B}^0, \mathbf{\Omega} = \mathbf{\Sigma}^{-1})\). Sparsity allows to control the number of parameters:

\[\begin{equation*} \arg\min_{\theta, \psi} J(\theta, \psi) + \lambda_1 \| \mathbf{B} \|_1 + \lambda_2 \| \mathbb{\Omega} \|_1 \color{#ddd}{ \left( + \lambda_1 \| \mathbf{B}^0 \|_1 \right)} \end{equation*}\]

Alternate optimization

  • (Stochastic) Gradient-descent on \(\mathbf{B}^0, \mathbf{M}, \mathbf{S}\)
  • Closed-form for posterior probabilities \(\mathbf{R}\)
  • Inverse covariance \(\mathbf{\Omega}\)
    • if \(\lambda_2=0\), \(\hat{\mathbf{\Sigma}} = n^{-1} \left[ (\mathbf{M} - \mathbf{XB})^\top(\mathbf{M} - \mathbf{XB}) + \bar{\mathbf{S}}^2 \right]\)
    • if \(\lambda_2 > 0\), \(\ell_1\) penalized MLE ( \(\rightsquigarrow\) Graphical-Lasso with \(\hat{\mathbf{\Sigma}}\) as input)
  • PLN regression coefficient \(\mathbf{B}\)
    • if \(\lambda_1=0\), \(\hat{\mathbf{B}} = [\mathbf{X}^\top \mathbf{X}]^{-1} \mathbf{X}^\top \mathbf{M}\)
    • if \(\lambda_1 > 0\), vectorize and solve a \(\ell_1\) penalized least-squared problem

Initialize With univariate zero-inflated Poisson regression models

Enhancing variational approximation (1)

Two paths of improvements to break less dependencies between the latent variables:

\[p(\mathbf{Z}_i, \mathbf{W}_i | \mathbf{Y}_i) \approx q(\mathbf{Z}_i, \mathbf{W}_i) \triangleq \left\{\begin{array}{l} \prod_j q(W_{ij} | Z_{ij}) q(Z_ {ij}) \\[1ex] \prod_j q(Z_{ij} | W_{ij}) q(W_{ij}) \\ \end{array}\right.\]

The \(W|Z,Y\) path

One can show that

\[ W_{ij}| Y_{ij},Z_{ij} \sim\mathcal B \left( \mathrm{logit}^{-1} \left( \log\left(\frac{\pi_{ij}}{1- \pi_{ij}}\right) + Z_{ij} \right)\right) \boldsymbol 1_{\{Y_{ij} = 0\}} \]

Sadly, the resulting ELBO involves the untractable entropy term \(\tilde{\mathbb{E}}[\log q_{\psi}(\mathbf{W} | \mathbf{Z})]\)

\(\rightsquigarrow\) requires computing \(\tilde{\mathbb{E}}\left[ \log\left( \mathrm{logit}^{-1}\left(U\right) \right)\mathrm{logit}^{-1}\left(U\right) \right]\) for arbitrary univariate Gaussians \(U\)

Enhancing variational approximation (2)

The \(Z|W,Y\) path

Since \(W_{ij}\) only takes two values, the dependence between \(Z_{ij}\) and \(W_{ij}\) can easily be highlighted:

\[Z_{ij} | W_{ij}, Y_{ij} = \left(Z_{ij}|Y_{ij}, W_{ij} = 1 \right)^{ W_{ij}}\left(Z_{ij}|Y_{ij}, W_{ij} = 0 \right)^{1- W_{ij}}.\] Then, \(p(Z_{ij}| Y_{ij}, W_{ij} = 1) = p(Z_{ij} | (W_{ij} = 1) = p(Z_{ij})\) by independence of \(Z_{ij}\) and \(W_{ij}\).

\(\rightsquigarrow\) Only an approximation of \(Z_{ij} | Y_{ij}, W_{ij} = 0\) is needed.

More accurate variational approximation

\[\begin{aligned} q_{\psi_i}(\boldsymbol Z_i, \boldsymbol W_i) & = q_{\psi_i}(\boldsymbol Z_i | \boldsymbol W_i) q_{\psi_i}(\boldsymbol W_i) \\ & = \otimes_{j = 1}^p \mathcal{N}(\boldsymbol x_i^\top \mathbf{B}_j, \Sigma_{jj})^{W_{ij}} \mathcal{N}(M_{ij}, S_{ij}^2)^{1-W_{ij}} W_{ij}, \quad W_{ij} \sim^\text{indep} \mathcal{B}\left(\rho_{ij}\right)\end{aligned}.\]

Counterpart

We loose close-forms in M Step of VEM for \(\hat{\mathbf{B}}\) and \(\hat{\mathbf{\Sigma}}\) in the corresponding ELBO…

Additional refinement

Optimization using analytic law of \(W_{ij}| Y_{ij}\)

Proposition 2 (Distribution of \(W_{ij}| Y_{ij}\)) \[W_{ij} | Y_{ij} \sim \mathcal{B}\left(\frac{\pi_{ij}}{ \varphi\left(\mathbf{x}_i^\top \boldsymbol B_j, \Sigma_{jj}\right) \left(1 - \pi_{ij}\right) + \pi_{ij}}\right) \boldsymbol 1_{\{Y_{ij} = 0\}}\]

with \(\varphi(\mu,\sigma^2) = \mathbb E \left[ \exp(-X)\right], ~ X \sim \mathcal L \mathcal N \left( \mu, \sigma^2\right)\).

Approximation of \(\varphi\)

The function \(\varphi\) is intractable but an approximation (Rojas-Nandayapa 2008) can be computed:

\[\varphi(\mu, \sigma^2)\approx \tilde \varphi(\mu, \sigma^2)= \frac{\exp \left(-\frac{L^2\left(\sigma^2 e^\mu\right)+2 L\left(\sigma^2 e^\mu\right)}{2 \sigma^2}\right)}{\sqrt{1+L\left(\sigma^2 e^\mu\right)}},\] where \(L(\cdot)\) is the Lambert function (i.e. \(z = x \exp(x) \Leftrightarrow x = L(z), x,z \in \mathbb R\)).

Microcosm data analysis

Fit various ZI-PLN models


Playing with the ZI component

Try various way of modeling probablity of being zero:

ZI            <- ZIPLN(Abundance ~ 1 + offset(log(Offset))           , data = microcosm)
ZI_site       <- ZIPLN(Abundance ~ 1 + offset(log(Offset)) | 0 + site,  data = microcosm)
ZI_time       <- ZIPLN(Abundance ~ 1 + offset(log(Offset)) | 0 + time,  data = microcosm)
ZI_site_time  <- ZIPLN(Abundance ~ 1 + offset(log(Offset)) | 0 + site_time,  data = microcosm)


Include covariates both in PLN and ZI components

ZI_site_PLN_site <- ZIPLN(Abundance ~ 0 + site      + offset(log(Offset)) | 0 + site,  data = microcosm)
ZI_time_PLN_time <- ZIPLN(Abundance ~ 0 + time      + offset(log(Offset)) | 0 + time,  data = microcosm)
ZI_PLN_site_time <- ZIPLN(Abundance ~ 0 + site_time + offset(log(Offset)) | 0 + site_time, data = microcosm)

Other possible combinaisons (still less effective).

Locate the best model

Model selection criteria
The top two largest are presented
model nb_param loglik BIC ICL
PLN 33929 -220146.1 -335526.4 -856215.2
PLN site 34706 -217792.1 -335814.7 -856263.3
PLN time 34706 -218444.9 -336467.6 -854845.5
PLN site * time 37555 -213942.0 -341653.1 -859927.5
ZI 33930 -216464.3 -331848.1 -594390.5
ZI site 34188 -203256.9 -319518.0 -531080.2
ZI time 34188 -204861.7 -321122.8 -578202.4
ZI site * time 34188 -194490.5 -310751.7 -546597.5
ZI site PLN site 35742 -201981.4 -323527.1 -563262.0
ZI time PLN time 35742 -206516.2 -328061.9 -534570.5
ZI and PLN site * time 41440 -191913.2 -332835.8 -500573.7

\(\rightsquigarrow\) Ok, let us keep model with site and time with main and interaction effects in both ZI an PLN components.

Latent means of the PLN component (w/wo covariate effects)

Same for standard PLN 😊1

Clustering of the latent means vs probability of zero inflation

Latent means vs probability of zero inflation

Model fits

Fits of zeros

Residual covariance

Reorder by hierarchical clustering with

Network

Inference + SBM

network <- ZIPLN(Abundance ~ 0 + site_time + offset(log(Offset)) | 0 + site_time, data = microcosm,
                 control = ZIPLN_param(penalty = 0.3, inception = ZI_PLN_site_time))
adjacency_mat <- plot(network, plot = FALSE, type = "support", output = "corrplot") %>% as.matrix()
mySBM <- sbm::estimateSimpleSBM(adjacency_mat, dimLabels = "ASV", estimOptions = list(plot=FALSE))

\(\rightsquigarrow\) To Jeanne’s work!

References

Aitchison, J., and C. H. Ho. 1989. “The Multivariate Poisson-Log Normal Distribution.” Biometrika 76 (4): 643–53.
Chiquet, Julien, Mahendra Mariadassou, and Stéphane Robin. 2021. “The Poisson-Lognormal Model as a Versatile Framework for the Joint Analysis of Species Abundances.” Frontiers in Ecology and Evolution 9. https://doi.org/10.3389/fevo.2021.588292.
Mariadassou, Mahendra, Laurent X Nouvel, Fabienne Constant, Diego P Morgavi, Lucie Rault, Sarah Barbey, Emmanuelle Helloin, et al. 2023. “Microbiota Members from Body Sites of Dairy Cows Are Largely Shared Within Individual Hosts Throughout Lactation but Sharing Is Limited in the Herd.” Animal Microbiome 5 (1): 32.
Rojas-Nandayapa, Leonardo. 2008. “Risk Probabilities: Asymptotics and Simulation.”